1,091 research outputs found
Haplotype Assembly: An Information Theoretic View
This paper studies the haplotype assembly problem from an information
theoretic perspective. A haplotype is a sequence of nucleotide bases on a
chromosome, often conveniently represented by a binary string, that differ from
the bases in the corresponding positions on the other chromosome in a
homologous pair. Information about the order of bases in a genome is readily
inferred using short reads provided by high-throughput DNA sequencing
technologies. In this paper, the recovery of the target pair of haplotype
sequences using short reads is rephrased as a joint source-channel coding
problem. Two messages, representing haplotypes and chromosome memberships of
reads, are encoded and transmitted over a channel with erasures and errors,
where the channel model reflects salient features of high-throughput
sequencing. The focus of this paper is on the required number of reads for
reliable haplotype reconstruction, and both the necessary and sufficient
conditions are presented with order-wise optimal bounds.Comment: 30 pages, 5 figures, 1 tabel, journa
Recommended from our members
Coding mechanisms for communication and compression : analysis of wireless channels and DNA sequencing
textThis thesis comprises of two related but distinct components: Coding arguments for communication channels and information-theoretic analysis for haplotype assembly. The common thread for both problems is utilizing information and coding theoretic principles in understanding their underlying mechanisms. For the first class of problems, I study two practical challenges that prevent optimal discrete codes utilizing in real communication and compression systems, namely, coding over analog alphabet and fading. In particular, I use an expansion coding scheme to convert the original analog channel coding and source coding problems into a set of independent discrete subproblems. By adopting optimal discrete codes over the expanded levels, this low-complexity coding scheme can approach Shannon limit perfectly or in ratio. Meanwhile, I design a polar coding scheme to deal with the unstable state of fading channels. This novel coding mechanism of hierarchically utilizing different types of polar codes has been proved to be ergodic capacity achievable for several fading systems, without channel state information known at the transmitter. For the second class of problems, I build an information-theoretic view for haplotype assembly. More precisely, the recovery of the target pair of haplotype sequences using short reads is rephrased as the joint source-channel coding problem. Two binary messages, representing haplotypes and chromosome memberships of reads, are encoded and transmitted over a channel with erasures and errors, where the channel model reflects salient features of highthroughput sequencing. The focus is on determining the required number of reads for reliable haplotype reconstruction.Electrical and Computer Engineerin
Polar Coding for Fading Channels
A polar coding scheme for fading channels is proposed in this paper. More
specifically, the focus is Gaussian fading channel with a BPSK modulation
technique, where the equivalent channel could be modeled as a binary symmetric
channel with varying cross-over probabilities. To deal with variable channel
states, a coding scheme of hierarchically utilizing polar codes is proposed. In
particular, by observing the polarization of different binary symmetric
channels over different fading blocks, each channel use corresponding to a
different polarization is modeled as a binary erasure channel such that polar
codes could be adopted to encode over blocks. It is shown that the proposed
coding scheme, without instantaneous channel state information at the
transmitter, achieves the capacity of the corresponding fading binary symmetric
channel, which is constructed from the underlying fading AWGN channel through
the modulation scheme.Comment: 6 pages, 4 figures, conferenc
Lossy Compression of Exponential and Laplacian Sources using Expansion Coding
A general method of source coding over expansion is proposed in this paper,
which enables one to reduce the problem of compressing an analog
(continuous-valued source) to a set of much simpler problems, compressing
discrete sources. Specifically, the focus is on lossy compression of
exponential and Laplacian sources, which is subsequently expanded using a
finite alphabet prior to being quantized. Due to decomposability property of
such sources, the resulting random variables post expansion are independent and
discrete. Thus, each of the expanded levels corresponds to an independent
discrete source coding problem, and the original problem is reduced to coding
over these parallel sources with a total distortion constraint. Any feasible
solution to the optimization problem is an achievable rate distortion pair of
the original continuous-valued source compression problem. Although finding the
solution to this optimization problem at every distortion is hard, we show that
our expansion coding scheme presents a good solution in the low distrotion
regime. Further, by adopting low-complexity codes designed for discrete source
coding, the total coding complexity can be tractable in practice.Comment: 8 pages, 3 figure
Multi-Scenario Ranking with Adaptive Feature Learning
Recently, Multi-Scenario Learning (MSL) is widely used in recommendation and
retrieval systems in the industry because it facilitates transfer learning from
different scenarios, mitigating data sparsity and reducing maintenance cost.
These efforts produce different MSL paradigms by searching more optimal network
structure, such as Auxiliary Network, Expert Network, and Multi-Tower Network.
It is intuitive that different scenarios could hold their specific
characteristics, activating the user's intents quite differently. In other
words, different kinds of auxiliary features would bear varying importance
under different scenarios. With more discriminative feature representations
refined in a scenario-aware manner, better ranking performance could be easily
obtained without expensive search for the optimal network structure.
Unfortunately, this simple idea is mainly overlooked but much desired in
real-world systems.Further analysis also validates the rationality of adaptive
feature learning under a multi-scenario scheme. Moreover, our A/B test results
on the Alibaba search advertising platform also demonstrate that Maria is
superior in production environments.Comment: 10 pages
- …